Introduction

To re-cap - we have obtained our raw counts matrices from GEO (or cellranger). We have performed QC and subset the data according to the QC plots we generated. We have normalised the data via sctransform - regressing out unwanted sources of variation, and integrated across our datasets. Lastly, we have clustered the cells into separate groups based on the similarity of their expression profiles.

In the next part of the course we are going to explore the genes that show the most differences in their expression between clusters. The top genes showing the biggest change in their expression compared to the other clusters are often called marker genes. We can then use these marker genes to assign an identity class to our clusters of interest.

Learning Objectives

  • Find differentially expressed genes using FindMarkers() and FindAllMarkers().
  • Use the bioconductor human annotation package org.Hs.eg.db to add some more useful annotation to our marker genes.
  • Assign cluster identity using cell type canonical marker genes
  • Carry out a simple pathway analysis

Firstly, we will load up the packages we have been using so far for this part of the course.

library(Seurat, quietly = T)
library(tidyverse, quietly = T)

And then we can load up the R objects we have used in the previous sessions.

seurat <- readRDS("Robjects/seurat_integrated.RDS")

We can simply run the name of the Seurat object (in our case seurat) to give us a summary of what the object contains and what calculations have been performed on it.

seurat
An object of class Seurat 
31850 features across 3322 samples within 3 assays 
Active assay: integrated (3000 features, 3000 variable features)
 2 other assays present: RNA, SCT
 3 dimensional reductions calculated: pca, tsne, umap

We can recap our clusters by taking a look at the DimPlot

DimPlot(seurat, reduction = "umap")

Find differentially expressed genes

Our next objective is to start assigning some biological meaning to our data. We have some nice clusters but no clue yet what cell types our clusters contain.

In Seurat we can find the top differentially expressed genes -or markers - within each cluster. The function FindMarkers() handles most of the differential expression testing within the Seurat package and finds marker genes within a particular cluster. As a default, Seurat performs differential expression based on the non-parameteric Wilcoxon rank sum test but other methods are available within the function.

To start with we can find all the differentially expressed genes in cluster 3 compared to every other cluster. The min.pct argument requires a gene to be detected at a minimum percentage of all the cells.

Since we have performed integration in a previous step, we set the default assay back to ‘RNA’ to ensure we are using the raw counts, rather than counts transformed by integration, for our differential analysis.

# We make sure the default assay is set back to "RNA" to use the original counts for DE analysis.
DefaultAssay(seurat) <- "RNA"
cluster3.markers <- FindMarkers(seurat, 
                                ident.1 = 3, 
                                min.pct  = 0.25)
cluster3.markers %>% head

A quick look at the top 6 lines in our differentially expressed marker gene table shows us the format of the data. We have the results of the Wilcoxon rank sum test as a p value and adjusted p value for multiple testing. We also have our average log2 fold change in expression between our clusters. The pct.1 and pct.2 columns show us the percentage of cells within the cluster the gene is expressed in (we set a minimum threshold of 20% in the function). As default, the output table is sorted by adjusted p value.

A littlebit of table manipulation in dplyr makes the markers easier to view…

cluster3.markers  %>% 
  select(avg_log2FC, p_val, p_val_adj) %>%
  slice_max(n = 5, order_by = avg_log2FC) 

To check our DE findings, we can visualize expression of a particular gene of interest across all clusters using the VlnPlot() function. This shows the expression levels of a gene (or genes) of interest within the cells across all clusters.

VlnPlot(seurat, features = "LYZ")

We can also check which cells express our gene of interest on the UMAP image using FeaturePlot

FeaturePlot(seurat, features = "LYZ")

Perhaps we are most interested in the differentially expressed genes between clusters 1 and 3

cluster1.markers <- FindMarkers(seurat, 
                                ident.1 = 1, 
                                ident.2 = 3, 
                                min.pct = 0.25,verbose = FALSE)
cluster1.markers %>% head()

… or the DE genes between clusters 1 in comparison to 3 and 4?

cluster1.markers <- FindMarkers(seurat, ident.1 = 1, 
                                ident.2 = c(3,4), # we can group clusters together!
                                min.pct = 0.25,
                                verbose = FALSE)
cluster1.markers %>% head()

Usually, we are wanting to find the most differentially expressed genes in each cluster across the whole dataset. For this we use FindAllMarkers. This time we will only report the positively expressed markers with a minimum log fold change of 0.25.

markers <- FindAllMarkers(seurat, 
                          only.pos = TRUE, 
                          min.pct = 0.25, 
                          logfc.threshold = 0.25, 
                          verbose = F) # keeps it quiet!
markers

Again, a bit of manipulation in dplyr enables us to get the top 5 DE genes in each cluster.

markers %>%
  select(gene, cluster, avg_log2FC, p_val, p_val_adj) %>%
    group_by(cluster) %>% # important to look in each cluster!
    slice_max(n = 5, order_by = avg_log2FC)

Exercise

  • Use FindMarkers to find the differentially expressed genes in clusters 1 & 2 vs 8
  • Can you use dplyr to select columns of interest and order them by log fold change?
  • Use one of the visualisation tools to look at the expression of one of the top marker genes
  • Bonus! - We have not discussed the visualisation tool RidgePlot. See if you can use it to make another plot of one of the top marker genes. More info here - https://satijalab.org/seurat/articles/visualization_vignette.html
#Put exercise work in here :-)

Annotating marker genes

We may want to share our findings with collaborators or perform further bioinformatic analysis on our DE genes in clusters of interest. In this dataset, the gene level information is given as a gene name. Whilst this is useful we may want to embellish these results with some gene description information or some other gene IDs (eg Ensembl, Entrez).

The org.Hs.eg.db database is a genome wide annotation package for human that has many different gene mappings based on Entrez Gene identifiers. These are up to data databases of gene annotation that can be installed like any other bioconductor package

# Dont run this if you have run already!
#install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")

library(org.Hs.eg.db)

In our case we are going to use the gene symbols (rownames in our Seurat object) to extract the gene description and entrez ID from the database.

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes function.

keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
[10] "GENENAME"     "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
[19] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIPROT"     

We are going to use the keytype ‘SYMBOL’ which designates the gene names that we have in our rownames of the Seurat object.

Unfortunately, the authors of dplyr and AnnotationDbi have both decided to use the name select in their packages. To avoid confusion and problems from packages attempting to use the wrong tool, the following code is sometimes used:-

AnnotationDbi::select which tells us to use the ‘select’ function from the AnnotationDBI package.

anno <- AnnotationDbi::select(org.Hs.eg.db,
                              keys=rownames(seurat),
                              columns=c("SYMBOL","GENENAME", "ENTREZID"),  # info we want to extract
                              keytype = "SYMBOL") # information we have

# Check it worked!
head(anno)  

Now we can combine our annotation table to our marker genes using dplyr and left_join

anno_markers <- markers %>% 
  dplyr::select(gene, cluster, avg_log2FC, p_val, p_val_adj) %>%
  left_join(anno, by = c("gene" = "SYMBOL"))

anno_markers %>% head()
NA

At this point we might want to save our result into a nicely formatted file where we can look up our clusters, top DE genes and their gene descriptions.

anno_markers %>% write_csv("Annotated.results.csv")

Assigning cluster identity

If you recall, our dataset is from the following paper:-

Caron M, St-Onge P, Sontag T, Wang YC et al. Single-cell analysis of childhood leukemia reveals a link between developmental states and ribosomal protein expression as a source of intra-individual heterogeneity. Sci Rep 2020 May15;10(1):8079

This paper uses the following canonical marker genes to assign cell cluster identity. This starts to give some biological meaning to our results.

These genes are known to be expressed in a particular cell type.

Gene Cell type
CD79A B Cells
CD34 CD34+,
MS4A1 CD20+ B cells,
MZB1 BCMA,
CST3 Monocytes,
SPN Im mature hematopoietic,
HBA1 erythrocytes,
CD3D T cells,
NKG7 NK cells

We can assign these marker genes as a variable

cell_markers <- c("CD79A","CD34", "MS4A1", "MZB1","CST3" ,"SPN", "KIT", "HBA1", "CD3D", "NKG7")

Dot plots allow us to examine particular features or genes of interest across all clusters. Creating a dotplot of the cell type markers lets us assign identity based on their expression.

DotPlot(seurat, features = cell_markers) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can then manually create new identities for each cluster as follows.

new.cluster.ids <- c("T Cells", #0
                     "T Cells", #1
                     "Erythrocytes", #2  
                     "Monocytes",  #3
                     "B Cells", #4
                     "CD34+", #5
                     "B Cells", #6
                     "NK Cells", #7
                     "Immature hema#8topoietic", 
                     "CD79A B cells",#9 
                     "Monocytes", #10
                     "CD20+ B cells", #11
                     "BCMA", #12
                     "Erythrocytes", #13 
                     "BCMA") #14
names(new.cluster.ids) <- levels(seurat)
new.cluster.ids
                         0                          1                          2                          3                          4 
                 "T Cells"                  "T Cells"             "Erythrocytes"                "Monocytes"                  "B Cells" 
                         5                          6                          7                          8                          9 
                   "CD34+"                  "B Cells"                 "NK Cells" "Immature hema#8topoietic"            "CD79A B cells" 
                        10                         11                         12                         13                         14 
               "Monocytes"            "CD20+ B cells"                     "BCMA"             "Erythrocytes"                     "BCMA" 
seurat_named <- RenameIdents(seurat, new.cluster.ids)
DimPlot(seurat_named, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

We can check our result to the UMAP plot produced from the full dataset within the paper - not too bad! Considering we have used a subset and not followed their exact method.

GO term and pathway enrichment

Sometimes we want to carry out a Gene Enrichment Analysis on a list of genes of interest. This enables us to find functional terms that are statistically over represented within a group of genes compared to a background list. clusterProfiler is an R package that provides statistical tests for expression analysis of terms such as GO (Gene Ontology), within gene lists that have shown statistically significant differences.

When using this tool we need to have two inputs: a list of genes that show significant differentiation in expression and a list of background genes. Hypergeometric tests are used for the statistical enrichment analysis.

We can check for GO terms measuring BP (biological process), CC (cell component) and MF (molecular function).

First of all we need to create our test genes and background list

library(clusterProfiler)
background <- anno %>% pull("ENTREZID")

go_markers <- anno_markers %>% filter(cluster ==c(0,1)) %>% pull(ENTREZID) # Lets look at clusters (1 + 2 -  T cells)

  eGO <- enrichGO(
    gene          = go_markers,
    universe = background, 
    OrgDb         = org.Hs.eg.db,
    ont           = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    readable      = TRUE
  )
  
head(eGO)

clusterProfiler also has functionality to plot pathway diagrams for us. It uses our new object created by enrichGO as input and highlights the pathways enriched in our dataset.

plotGOgraph(eGO)
$dag
A graphNEL graph with directed edges
Number of Nodes = 65 
Number of Edges = 109 

$complete.dag
[1] "A graph with 65 nodes."

---
title: "Analysis of Single-Cell RNA-seq - Session 4"
author: "Emily V Chambers"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
    css: stylesheets/styles.css
---
<img src="images/logo-sm.png" style="position:absolute;top:40px;right:10px;" width="200" />

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 95),
  message = FALSE,
  warning = FALSE,
  fig.width = 10,
  time_it = TRUE
)

```

# Introduction

To re-cap -  we have obtained our raw counts matrices from GEO (or cellranger). We have performed QC and subset the data according to the QC plots we generated. We have normalised the data via sctransform - regressing out unwanted sources of variation, and integrated across our datasets. Lastly, we have clustered the cells into separate groups based on the similarity of their expression profiles.

In the next part of the course we are going to explore the genes that show the most differences in their expression between clusters. The top genes showing the biggest change in their expression compared to the other clusters are often called *marker genes*. We can then use these marker genes to assign an *identity class* to our clusters of interest.

![](images/flowchart3.png)

## Learning Objectives
* Find differentially expressed genes using `FindMarkers()` and `FindAllMarkers()`.
* Use the bioconductor human annotation package `org.Hs.eg.db` to add some more useful annotation to our marker genes.
* Assign cluster identity using cell type canonical marker genes
* Carry out a simple pathway analysis

Firstly, we will load up the packages we have been using so far for this part of the course.

```{r}
library(Seurat, quietly = T)
library(tidyverse, quietly = T)
```
And then we can load up the R objects we have used in the previous sessions.
```{r}
seurat <- readRDS("Robjects/seurat_integrated.RDS")
```
We can simply run the name of the Seurat object (in our case `seurat`) to give us a summary of what the object contains and what calculations have been performed on it.
```{r}
seurat
```
We can recap our clusters by taking a look at the `DimPlot`
```{r}
DimPlot(seurat, reduction = "umap")
```


# Find differentially expressed genes
Our next objective is to start assigning some *biological meaning* to our data. We have some nice clusters but no clue yet what cell types our clusters contain.

In Seurat we can find the top differentially expressed genes -or markers - within each cluster. The function `FindMarkers()` handles most of the differential expression testing within the Seurat package and finds marker genes within a particular cluster. As a default, Seurat performs differential expression based on the non-parameteric Wilcoxon rank sum test but other methods are available within the function. 

To start with we can find all the differentially expressed genes in cluster 3 compared to every other cluster. The `min.pct` argument requires a gene to be detected at a minimum percentage of all the cells.

Since we have performed integration in a previous step, we set the default assay back to 'RNA' to ensure we are using the raw counts, rather than counts transformed by integration, for our differential analysis.
```{r}
# We make sure the default assay is set back to "RNA" to use the original counts for DE analysis.
DefaultAssay(seurat) <- "RNA"
```


```{r}
cluster3.markers <- FindMarkers(seurat, 
                                ident.1 = 3, 
                                min.pct  = 0.25)
cluster3.markers %>% head
```
A quick look at the top 6 lines in our differentially expressed marker gene table shows us the format of the data. We have the results of the Wilcoxon rank sum test as a p value and adjusted p value for multiple testing. We also have our average log2 fold change in expression between our clusters. The pct.1 and pct.2 columns show us the percentage of cells within the cluster the gene is expressed in (we set a minimum threshold of 20% in the function). As default, the output table is sorted by adjusted p value.

A littlebit of table manipulation in `dplyr` makes the markers easier to view...
```{r}
cluster3.markers  %>% 
  select(avg_log2FC, p_val, p_val_adj) %>%
  slice_max(n = 5, order_by = avg_log2FC) 
```
To check our DE findings, we can visualize expression of a particular gene of interest across all clusters using the `VlnPlot()` function. This shows the expression levels of a gene (or genes) of interest within the cells across all clusters.

```{r}
VlnPlot(seurat, features = "LYZ")
```
We can also check which cells express our gene of interest on the UMAP image using `FeaturePlot`

```{r}
FeaturePlot(seurat, features = "LYZ")
```
Perhaps we are most interested in the differentially expressed genes between clusters 1 and 3

```{r message=FALSE}
cluster1.markers <- FindMarkers(seurat, 
                                ident.1 = 1, 
                                ident.2 = 3, 
                                min.pct = 0.25,verbose = FALSE)
cluster1.markers %>% head()
```
... or the DE genes between clusters 1 in comparison to 3 and 4?
```{r}
cluster1.markers <- FindMarkers(seurat, ident.1 = 1, 
                                ident.2 = c(3,4), # we can group clusters together!
                                min.pct = 0.25,
                                verbose = FALSE)
cluster1.markers %>% head()
```
Usually, we are wanting to find the most differentially expressed genes in each cluster across the whole dataset. For this we use `FindAllMarkers`. This time we will only report the positively expressed markers with a minimum log fold change of 0.25.

```{r}
markers <- FindAllMarkers(seurat, 
                          only.pos = TRUE, 
                          min.pct = 0.25, 
                          logfc.threshold = 0.25, 
                          verbose = F) # keeps it quiet!
markers
```

Again, a bit of manipulation in `dplyr` enables us to get the top 5 DE genes in each cluster.
```{r}
markers %>%
  select(gene, cluster, avg_log2FC, p_val, p_val_adj) %>%
    group_by(cluster) %>% # important to look in each cluster!
    slice_max(n = 5, order_by = avg_log2FC)
```
## Exercise

<div class="exercise">
- Use `FindMarkers` to find the differentially expressed genes in clusters 1 & 2 vs 8
- Can you use `dplyr` to select columns of interest and order them by log fold change?
- Use one of the visualisation tools to look at the expression of one of the top marker genes
- Bonus! - We have not discussed the visualisation tool `RidgePlot`. See if you can use it to make another plot of one of the top marker genes. More info here - https://satijalab.org/seurat/articles/visualization_vignette.html
</div>

```{r}
#Put exercise work in here :-)


```


# Annotating marker genes

We may want to share our findings with collaborators or perform further bioinformatic analysis on our DE genes in clusters of interest. In this dataset, the gene level information is given as a gene name. Whilst this is useful we may want to embellish these results with some gene description information or some other gene IDs (eg Ensembl, Entrez).

The `org.Hs.eg.db` database is a genome wide annotation package for human that has many different gene mappings based on Entrez Gene identifiers. These are up to data databases of gene annotation that can be installed like any other bioconductor package

```{r}
# Dont run this if you have run already!
#install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")

library(org.Hs.eg.db)
```

In our case we are going to use the gene symbols (rownames in our Seurat object) to extract the gene description and entrez ID from the database.

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the `keytypes` function.

```{r}
keytypes(org.Hs.eg.db)
```
We are going to use the keytype 'SYMBOL' which designates the gene names that we have in our rownames of the Seurat object.

Unfortunately, the authors of `dplyr` and `AnnotationDbi` have both decided to use the name `select` in their packages. To avoid confusion and problems from packages attempting to use the wrong tool, the following code is sometimes used:-

`AnnotationDbi::select` which tells us to use the 'select' function from the `AnnotationDBI` package.
```{r}
anno <- AnnotationDbi::select(org.Hs.eg.db,
                              keys=rownames(seurat),
                              columns=c("SYMBOL","GENENAME", "ENTREZID"),  # info we want to extract
                              keytype = "SYMBOL") # information we have

# Check it worked!
head(anno)  
```
Now we can combine our annotation table to our marker genes using `dplyr` and `left_join`
```{r}
anno_markers <- markers %>% 
  dplyr::select(gene, cluster, avg_log2FC, p_val, p_val_adj) %>%
  left_join(anno, by = c("gene" = "SYMBOL"))

anno_markers %>% head()

```
At this point we might want to save our result into a nicely formatted file where we can look up our clusters, top DE genes and their gene descriptions.

```{r}
anno_markers %>% write_csv("Annotated.results.csv")
```

# Assigning cluster identity

If you recall, our dataset is from the following paper:-

> Caron M, St-Onge P, Sontag T, Wang YC et al. Single-cell analysis of childhood leukemia reveals a link between developmental states and ribosomal protein expression as a source of intra-individual heterogeneity. Sci Rep 2020 May15;10(1):8079

This paper uses the following *canonical marker genes* to assign cell cluster identity. This starts to give some *biological meaning* to our results.

These genes are known to be expressed in a particular cell type.

Gene    Cell type
----    ----
CD79A   B Cells
CD34    CD34+, 
MS4A1   CD20+ B cells, 
MZB1    BCMA,
CST3    Monocytes, 
SPN   Immature hematopoietic, 
HBA1    erythrocytes, 
CD3D    T cells, 
NKG7    NK cells


We can assign these marker genes as a variable
```{r}
cell_markers <- c("CD79A","CD34", "MS4A1", "MZB1","CST3" ,"SPN", "KIT", "HBA1", "CD3D", "NKG7")
```

Dot plots allow us to examine particular features or genes of interest across all clusters. Creating a dotplot of the cell type markers lets us assign identity based on their expression.
```{r}
DotPlot(seurat, features = cell_markers) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
We can then manually create new identities for each cluster as follows.

```{r}
new.cluster.ids <- c("T Cells", #0
                     "T Cells", #1
                     "Erythrocytes", #2  
                     "Monocytes",  #3
                     "B Cells", #4
                     "CD34+", #5
                     "B Cells", #6
                     "NK Cells", #7
                     "Immature hema#8topoietic", 
                     "CD79A B cells",#9 
                     "Monocytes", #10
                     "CD20+ B cells", #11
                     "BCMA", #12
                     "Erythrocytes", #13 
                     "BCMA") #14
names(new.cluster.ids) <- levels(seurat)
new.cluster.ids
seurat_named <- RenameIdents(seurat, new.cluster.ids)
```

```{r}
DimPlot(seurat_named, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
```
We can check our result to the UMAP plot produced from the full dataset within the paper - not too bad! Considering we have used a subset and not followed their exact method.
![](images/umaps.png)



# GO term and pathway enrichment
Sometimes we want to carry out a *Gene Enrichment Analysis* on a list of genes of interest. This enables us to find *functional terms* that are statistically over represented within a group of genes compared to a background list. `clusterProfiler` is an R package that provides statistical tests for expression analysis of terms such as GO (Gene Ontology), within gene lists that have shown statistically significant differences. 

When using this tool we need to have two inputs: a list of genes that show significant differentiation in expression and a list of background genes. Hypergeometric tests are used for the statistical enrichment analysis. 

We can check for GO terms measuring BP (biological process), CC (cell component) and MF (molecular function).

First of all we need to create our test genes and background list

```{r}
library(clusterProfiler)
background <- anno %>% pull("ENTREZID")

go_markers <- anno_markers %>% filter(cluster ==c(0,1)) %>% pull(ENTREZID) # Lets look at clusters (1 + 2 -  T cells)

  eGO <- enrichGO(
    gene          = go_markers,
    universe = background, 
    OrgDb         = org.Hs.eg.db,
    ont           = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    readable      = TRUE
  )
  
head(eGO)
```
`clusterProfiler` also has functionality to plot pathway diagrams for us. It uses our new object created by `enrichGO` as input and highlights the pathways enriched in our dataset.

```{r}
plotGOgraph(eGO)
```



